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ABSTRACT 


Video Stabilization has been widely researched and still is under active 
research considering the advancements that are being done in the field of 
digital imagery. Despite all the works, Hardware based Real time Video 
Stabilization systems; especially works dealing with implementation of 
Technology on prototype Implementation boards have been very few. This 
review works focuses on the specific aspect of the modern day Stabilization in 
systems implemented in prototyping boards and the algorithms that have 
been found suitable for such implementation taking considerations of cost, 


size and speed as the principal criterions. 
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1. INTRODUCTION 

Digital Video stabilization has been the more favored 
technique in recent times as compared to its more 
accurate but expensive counterparts like mechanical and 
optical stabilization. This is primarily because of the fact 
that Software based implementations provide a viable 
alternative to solutions having costly and bulky hardware 
assemblies like sensors, gyroscopes, lenses, making DVS 
based setups portable and cost effective. A generic DVS 
system consists of three major blocks as shown in fig 
1.The first block in Digital stabilization is of Motion 
Estimation where global motions in the frames of the 
video are extracted. Then, intentional motions are 
separated from the global motions in the second step 
which leads to motion correction stage. The last block 
deals with image correction to produces the final 
stabilized video using the estimated unintentional 
motions. 


Motion estimation is the most time consuming and 
difficult part in digital stabilization which also forms the 
basis of the operational speeds of the stabilization system. 
For real time stabilization mechanisms, computational 
speeds are of critical importance which actually are a 
function of the employed algorithm being computationally 
less intensive. Motion estimation, being the most 
complicated section, needs to be focused on the most for 
achieving the objectives of real time stabilization. Various 
mechanisms have been used to implement the motion 
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estimation stage having differing computation speeds and 
complexity levels, and also with varying degrees of 
accuracy. The paper discusses the following main things; 
the first part discusses the working details of Video 
Stabilization Techniques, the second part discusses the 
works and stabilization techniques that have been utilized 
for realtime applications, especially prototyped 
implementations. The last part discusses the realtime 
performance of feature based descriptors like SIFT and 
SURF and the associated advantages with them. 
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Fig 1 Generic Video Stabilization System 


2. Motion Estimation Techniques 

The motion Estimation is of five types in general: 
1. Gradient based methods 

2. Pel Recursive Techniques 

3. Block based Matching methods 

4. Feature Matching Methods 


2.1. Gradient techniques 

For image sequence analysis based applications, gradient 
techniques can be employed. Gradient techniques make 
use of the optical flow constrain equation. [1,2] 
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in order to find the motion vector ~v on the position ~r 
with some additional constraints. For example, a Horn- 
Schunck method [3] minimizes the square of the optical 
flow 
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Some preliminaru works employing gradient mechanisms 
have been done by Toshiaki Kondo, Pramuk Boonsieng 
et.al [4,5] wherein thet discussed two conventional 
gradient based motion estimation techniques. Since Unit 
gradient vectors have the advantage of being insensitive to 
constantly varying image intensity, the methods suggested 
by [4,5 ] were further improved in [5] by using unit 
gradient vectors rather than utilizing image intensities. 
The works resulted in better motion estimation 


techniques more robust to irregular lighting conditions as 
compared to conventional systems. 


In [6], information from consecutive frames was uses to 
account for the trans-rotational motions. The work used 
an optical flow based affine model for motion estimation 
using the Horn Schunck algorithm. A model fitter was the 
used to stabilize video sequences afterwards. 


2.2. Pixel recursive Techniques 

As a variation of gradient techniques, Pixel recursive 
techniques were introduced. It is basically an iterative 
gradient mechanism which minmizes predictive error 
using Displaced Frame Difference (DFD). It owes its roots 
to the Netravali Robbins method [7], which works on the 
principle of recursively updating the DFD vector according 
to the formula 


d+1 = dk—e DFD(i,t, d*). Vel(# — d,t — At) (3) 


2.3. Block Matching Techniques 

Block Matching Techniques [8] work on the minimization 
of a differentiative measure. In other words, blocks in 
current frame are matched with with those in previous 
frame. The best prediction is done by matching between 
the current block and all blocks in the search area, also 
known as full search algorithm. If the algorithm uses MSE 
as parameter for matching, then for every block size of 
16x16, the algorithm will require 256 subtractions, 256 
multiplications and 255 additions which is fairly resource 
exhaustive [9]. This is the primary reason the algorithm is 
rarely used, especially in real time applications. 


Search 
Block 


16 


Current Macro 


- Block 





Fig 2 Block matching a macro block of size 16 x 16 
pixels and a search Parameter p of size 7 pixels. 


Three step search (TSS) is a moderated version of the full 
search algorithm, with a search location at the center of 
the search area and searches in search window with sides 
of 4. Three motion estimation algorithm have been used in 
[10, 11]. The authors in [10] suggested a proposed a 
modified adaptive rood path search (N-ARPS) algorithm 
with small motion prejudgment (SMP), where it will 
decide on the block search strategy taking into account 
motion properties. Computational efficiency in [12] was 
improved using the three step search (TSS) along with 
GCBP matching that performed a competent search during 
correlation measure calculation. Another variation of 
block based searching uses Hexagon based search 
algorithm (HEXBS) [14] which is based on the same search 
pattern as in diamond search but instead uses a hexagon 
in place of diamond search pattern. An adaptation of the 
HEXBS was used by [14] wherein the direction of motion 
was predicted by a rood shaped pattern incorporated with 
Hexagonal based Search (HexBS) for refining search 
process. 


In context of realtime implementations, BBGDS (block 
based gradient descent search) and diamond search (DS) 
have been seen to outperform other algorithms in terms of 
lesser number of computations. The DS algorithm consists 
of a small diamond search pattern (SDSP) and large 
diamond search pattern (LDSP). The LDSP pattern is 
targeted at finding a match that will occur at the center of 
the LDSP. The works done in the field have shown 
simulation results using DS outperform TSS with a 
matched NTSS in terms of compensation error. It also has 
shown a much refined computational cost of the order of 
20 to 25 percent. DS has shown better prediction quality 
and lower complexity than TSS and 2D log search, making 
it a more preferred choice for software implementations. 
For hardware implementation, the two diamond sizes, 
however, makes the control circuitry slightly more 
complex. Another noteworthy work to address this issue 
with an end application of stabilization has been done by 
Song, Ma et al[16] wherein they have incorporated Hooke 
Jeeves algorithm into the traditional diamond search 
based fast block matching method, which as per their 
results has resulted in the efficiency of motion estimation 
being significantly improved. The testing has been done 
with 36 blocks (8x8 pixels) is selected in each frame with 
the proposed method consumed 24.79ms while 30.99ms 
for DS, the average time processing the same frame of 10 
consecutive frames; For a sample set of 49 blocks, it again 
faired better with 25.39ms verses 31.79ms. When the 
blocks are increased to 64, the proposed method 
(35.47ms) rather than DS (41.35ms), still being a real time 
video stabilization method. 


2.4. Feature Matching Methods 

Feature matching works on identifying scenes that are 
easily recognizable. Here, motion estimation can be 
performed by computing the displacements of these 
points of interest in the entire video frame by frame and 
by tracking the positions of these points of interest using 
the properties of the selected features, forming 
trajectories. One significant advantage is the fact that the 
same point can be tracked and recognized across many 
frames. Some of the more commonly used features 
detection algorithms are the Kanade Lucas Tomasi (KLT) 
feature tracker [52] that has been used to track features 
across videos in several methods [8,16, 20,21,32,34]. 
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Harris corners operator is also a features detector used for 
this purpose [53]. These features are then tracked using 
optical flow technique. 


SIFT based mechanisms have been widely used 
[54,38,40,41,43,45,33]. These features use descriptors 
based on the image gradient to obtain very specific 
descriptors with very reliable matching results. The 
descriptors are rotation invariant, but the fact that, they 
are slower than most alternatives, which makes them too 
complicated for realtime applications. SURF points 
although designed on similar principles [35], but are 
optimized for speed, making them a good alternative 
[46,8]. 


Other interesting features like Maximally Stable Extremal 
Regions (MSER) [55] or FAST corners using BRIEF 
descriptors [56,26] have been successfully used. Feature 
matching provides accurate and fast results, and the 
obtained trajectories allow for additional temporal 
analysis in the remaining steps of the process, although 
scenes with large uniform regions can sometimes yield 
few features per frame. This is one of the limitations of this 
kind of feature matching methods. 


3. Real time Embedded Video Stabilization Systems 
used for different applications 

This section primarily focuses on the works that have been 
done keeping in mind the realtime application of the end 
product and have dealt with a hardware implementation 
of the proposed algorithm. FPGA based prototyped 
implementations can give a correct analysis of the real- 
time usability of the VS algorithm as we can analyze the 
algorithms for computational intensiveness, resource 
utilization, speed, accuracy and output quality when 
implemented in a prototyping platform. Most of these 
works have been targeted for end applications like UAVS, 
off road vehicles, Robots etc, requiring different 
approaches, making use of varied computational resources 
for them to be implementable on embedded systems in 
real time. However, most of the referred methods have 
involved finding the 2D motion model to estimate the 
global motion path. Then the path is low pass filtered to 
remove the high frequency jitter component. The low 
frequency parameters are then posted onto frames. This 
mechanism has been found to be useful for sequences with 
minimal dynamic movement which is critical to for 
stabilizing UAV aerial videos. However, it becomes 
complicated when immediate stabilization, is required like 
in realtime processing only presuming the global motion 
path using a given window of frames. 


In [26], Oriented FAST and Rotated BRIEF algorithm was 
utilized for feature extraction followed by matching 
between consecutive frames. Here, interframe motion was 
computed using using Random Sample Consensus 
(RANSAC). The authors proposed framework to 
implement the whole feature based video stabilization 
process on a single FPGA chip, in order to achieve real 
time performance. A fully pipelined FPGA architecture was 
proposed to substantially accelerate feature based video 
stabilization in a highly parallel manner, which also 
provides a reference to accelerating other feature-based 
video processing tasks such as object tracking and video 
fusion on FPGA. The authors in [27] worked on a design 
specifically designed for real time embedded systems 
where the present and previous video frames were used to 


estimate the motion vector, followed by filtering 
operations. The implemention was done on a a Xilinx 
Spartan 6 LX45 board that was tested on a 640x480 pixel 
IR camera output. In Vazquez and Chang [28], a 
smoothening operation was done using the Kanade Lucas 
tracker to enable the detection of interest points. The 
undesired motion was balanced out by adjusting for 
additional rotation and displacements due to the 
vibrations. The stabilizing speeds achieved through this 
mechanism have been found to be in the range of 20fps to 
28fps for images having resolutions roughly of the order 
of 320x240 pixels. The proposed system was finally tested 
on MAC using a 2.16GHz Processor clock and a three frame 
delay. 


Wang [29], suggested a video stabilization method 
specifically designed for UAVs. In this literature, he 
proposed a 3 step mechanism using a FAST corner 
detector to locate the feature points existing in frames. In 
the second step, after matching, the key points were then 
used for estimation of affine transform to separate false 
matches. [30] discussed the video stabilization that has 
been implemented on an FPGA in the form of a Driven 
Mobile Robot. Further works include [31] where a real 
time video stabilization system has 


been implemented using high frame rate jitter sensor and 
a HS camera to retrieve feature points in gray level 
512x496 image sequences at 1000 fps. In (32), the 
stabilization work is based on local trajectories and robust 
mesh transformation. Every frame is refined in a different 
local trajectory matrix, helpful in modeling a nonlinear 
video stream. The method has been proposed to be 
performing better than conventional methods based on 
feature trajectories. L. M. Abdullah, N. Md Tahir et. al[36] 
use a system where Corner Detector System Object is used 
to find corner values using Harris Corner Detection which 
is one of the fastest algorithms to find corner values. After 
the salient points from each frame are obtained the 
correspondence between the points that are identified 
previously need to be picked. Authors in [33] have 
presented a feature-based approach for video stabilization 
that produces stabilized videos, while preserving the 
original resolution. . Lowe’s method is used to extract 
extract the SIFT feature points for every frame in a video. 
Fischler and Bolles’s RANSAC algorithm is used in the final 
fine stage to screen out the feature pairs of mismatch. 
Then, top eight pairs of feature points are chosen to forma 
linear system for deriving the perspective transformation 
between two consecutive frames. Another work that falls 
in line with the real time implementation of Video 
Stabilization in hardware has been represented in [34] 
Here an embedded real time video stabilization system 
targeted toward a Virtex-5 FPGA Xilinx board is given. The 
horizontal and vertical global video frame movements are 
calculated by making use of an integral projection 
matching approach 


4. Advantages of using Feature based extractors like 
SIFT and SURF for real time applications 
Apart from the other descriptors, the focus has been 
primarily on the works using SIFT and SURF descriptors, 
especially when it comes to real time stabilization works. 
It is because of the fact that their accuracy has been pretty 
good, but more so because they are computationally 
efficient. Scale invariant feature transform (SIFT) extracts 
and connects feature points in images which are invariant 
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to image scale, rotation and changes in illumination. 
Moreover, it provides distinctive descriptors that can find 
the correspondences between features in different images. 
Because of all these advantages, it is very suitable for 
estimating motion between images and hence can be very 
effectively deployed for motion estimation in Video 
Stabilization. 


However, again as per the referred literatures it was 
inferred that although SIFT has achieved remarkably 
success in video. stabilization, it suffers from 
comparatively costly computation, especially for realtime 
applications. This leads to need for viable replacements 
with a much lower computational cost. Demonstrably, one 
of the best of these methods that has been found in 
comparison to other descriptors has been found to be 
Speeded up robust features (SURF). 


SURF deploys the Hessian matrix due to the very reason 
that it has had a good history of performance on the 
performing standards of computation time and accuracy. 
In place of employing a different parameter for choosing 
the location and the Hessian Laplace detector, Speeded up 
Robust Features makes use of the determinant of the 
Hessian matrix for both the operations. For a particular 
pixel, the Hessian matrix of this pixel can be given as: 








arf o7f 
Ox2 0xd 

H(fQ%y)) = | oo, “jee (4) 
Ox Oy ay2 


For the purpose of adaptive scaling, the image is then 
filtered making use of a Gaussian kernel, so that for a point 
X = (x, y), the Hessian matrix becomes defined as: 


Lx 6) Lyy@% ©) 
H(G, °)) ~ Lyx (X, 0) es 0) 5 


where Lxx depicts the convolution of the Gaussian 2nd 
order derivative of the image I at point x, and likewise Lxy 
and Lyy. Gaussians derivatives are in general optimized 
for scale space analysis but at the same time without 
discretization and cropping. Due to this there is significant 
observed loss in repeatability while image rotations 
around in odd multiples. This is a major problem when it 
comes to Hessian based detectors. 


Despite this, the detectors have still been found to perform 
well, and this performance glitch doesn’t outclass the 
advantage of fast convolutions brought as a result of the 
discretization and cropping. For determinant of the 
Hessian matrix calculation, convolving is done with 
Gaussian kernel second order derivative. Lowe used LoG 
approximations with considerable success. SURF pushed 
the boundaries of the approximation for both convolution 
as well as second order derivative even more with the use 
of box filters. The approximated second order Gaussian 
derivatives are evaluated at a very low computational cost 
making use of integral images being independent of size 
hence enabling SURF algorithm the capability of fast 
computation which makes it the perfect choice for the 
applications of this work. 


The 9 x 9 box filters as approximations for Gaussian 
second order derivatives with o= 1.2 and depicts the 
lowest scale and the highest spatial resolution for 
computing the blob response maps. The approximations 
have been denoted by Dxx, Dyy, and Dyy. representing scale 


for computing the blob response. The weights applicable 
to rectangular regions are simple in nature to minimize 
computation. This finally produces the approximated 
determinant of the Hessian as: 


2 
det(Happrox) = DaxDyy — (wD,y) 
w = 0.9(Bays Suggestion) 

Or 


det(Happrox) = DyxDyy — (0.9Dyy) (6) 


Making use of box filters and integral images, SURF 
doesn’t require to recursively apply similar filter to the 
output of a prior filtered layer however can apply such 
filters of various sizes at similar speeds directly on the 
actual image, simultaneously. Therefore, the scale space is 
verified by up scaling the filter size (9x9 to 15x15 to 
21x21 to 27x27 etc) instead of recursively minimising the 
image size. An octave is a series of filter maps achieved by 
convolving input image with a filter of increasing size, 
consists of a scaling factor of 2 and also subdivided into a 
constant number of scale levels. Due to the discrete nature 
of integral images, the minimum scale difference between 
2 subsequent scales depends on the length 10 of the 
positive or negative lobes of the partial second order 
derivative in the direction of derivation (x or y), which has 
been fixed to a third of the filter size length. For the 9x9 
filter, this length is 3. For two successive levels, size is 
incremented by a minimum of 2 pixels in order to keep the 
size uneven and ensuring presence of the central pixel, 
finally resulting in an increased mask size. For localizing 
image interest points and over scales, a non maximum 
suppression in a 3 x 3 x 3 neighborhood is applied. The 
scale space is verified by upscaling the filter size in place 
of recursively minimising the image size. The following 
layers are then obtained by gradually filtering the image 
with larger masks. The biggest advantage of this type of 
sampling is its computational efficiency. Also, as we do not 
have to downsample the image, there is no aliasing. On the 
flipside, box filters tend to protect HF components are lost 
in magnified scenes, thereby limiting scale invariance. 
SURF descriptor creation is done normally as a two step 
mechanism, where first step is to fix a orientation on the 
basis of information from a circular region around the 
keypoint and then a region is constructed and aligned to 
the chosen orientation to extract the SURF descriptor. For 
rotational invariance, SURF identifies an orientation for 
the interest points. For achieving, SURF involves 
estimation of the response of Haar wavelets in xy 
directions. Besides this, the sampling step is totally scale 
dependent and is finally chosen to be s, with the wavelet 
responses being estimated at that current scale s. 
Similarly, increased scaling of the size of the wavelets is 
also substantial. Hence for this reason, the integral images 
are again employed for fast filtering. Then finally, the sum 
of vertical and horizontal wavelet responses is estimated 
in a scanning area, leading to change in the scanning 
orientation, and then again recalculating, till the time 
orientation with largest sum value is found, which in turn 
is the main orientation of feature descriptor. However, this 
step employs extra load of computation on the resources 
especially which are available on applications like offroad 
vehicles and UAVs in certain cases. Hence in order to 
further optimizing the SURF algorithm and in order to 
reduce the complexity for realtime applications, a 
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curtailed version of SURF termed as USURF can be 
employed. The authors in [35] proposed an upright 
version of the descriptor that is image rotation invariant 
resulting in faster computation and better suited for 
applications where the camera remains more or less 
horizontal Hence, the modified SURF also called U-SURF 
(or Upright SURF) has been found to be suited better for 
realtime applications for discussed applications. 


5. Conclusions 

Motion estimation and Digital Video Stabilization has been 
achieved using multiple mechanisms. However, for 
Realtime performance, considerable success has been 
performance enhancement has been achieved with the use 
of Feature based motion estimation mechanisms like 
SURF. We have seen that systems, and more specifically 
embedded systems, have shown better performance in 
terms of better outputs, frame processing speeds and 
utilization of resources. 


SURF as a descriptor has shown a good tradeoff between 
accuracy and speed. Most of the other descriptors 
including SIFT had better accuracies than SURF , however 
still we preferred SURF over the others because of the 
very fact speed was more important for us to ensure 
realtime functionality. An even faster variant of SURF i.e U- 
SURF can be made use of to further enhance the speeds 
taking into the account the fact that most of the 
application discussed in this paper have more significant 
translational motions than rotational motions. The 
architecture of the real time embedded system can be 
further optimized for even quicker performance by using 
pipelining of different stabilization system blocks and can 
facilitate the input frames to be fed and processed block by 
block. 
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Y. Wang, Q. Block Motion 2D Use of SAD in place of MSE, SSD due to Low pass filtering , No 
Huang[13] Estimation reduces computational intensity. performance metrics 
evaluated 
W. -x. Jin, X.-g, Block Motion Bit Plane Matching, Performs well Estimation of translational 
Di[37] Estimation 2D even for large angle rotational after rotational motion 
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Yangke Liu Feature Based 2D : : 
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Xie,Q.& estimation(K Means | 2D Applicable to handheld devices with | the tiny block matching 
Chen[49] : optimal resources suffers from computation 
Clustering) : 

cost and complex motion 

vector classification 
A. J. Amiri and Use of BRISK as detectors along 
H.Moradi[a) _ | SURF 20 with RANSAC LotpS ony 
C. Fang, T. Block Matching Use of rotation-based block 

: ae 2D matching method for local motion Not realtime 
Tsai[50] Estimation ee 
estimation 
r Chen SIFT 2D Full frame Stabilization Not realtime 
S. Lin[33] 
Deep Neural Frames need to be buffered 
M. Wang[51] Necks 2D Development of StabNet Database Sea ctoiad 
Pat Medaand Global Motion ‘ Better MSE results can be 
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